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Abstract 

It is now practically the norm for data to be very high dimensional in areas such as genetics, machine 
vision, image analysis and many others. When analyzing such data, parametric models are often too 
inflexible while nonparametric procedures tend to be non-robust because of insufficient data on these 
high dimensional spaces. It is often the case with high-dimensional data that most of the variability 
tends to be along a few directions, or more generally along a much smaller dimensional submanifold of 
the data space. In this article, we propose a class of models that flexibly learn about this submanifold 
and its dimension which simultaneously performs dimension reduction. As a result, density estimation is 
carried out efficiently. When performing classification with a large predictor space, our approach allows 
the category probabilities to vary nonparametrically with a few features expressed as linear combinations 
of the predictors. As opposed to many black-box methods for dimensionality reduction, the proposed 
model is appealing in having clearly interpretable and identifiable parameters. Gibbs sampling methods 
are developed for posterior computation, and the methods are illustrated in simulated and real data 
applications. 

keywords: Dimension reduction; Classifier; Variable selection; Nonparametric Bayes 



1 Introduction 

Data that are generated from experiments or studies carried out in areas such as genetics, machine vision, 
and image analysis (to name a few) are routinely high dimensional. Because such data sets have become 
so commonplace, designing data efficient inference techniques that scale to massive dimensional Euclidean 

and oven non-Euclidean spaces has attracted considerable attention in the statistical and machine learning 
literatiue. 

When dealing with high dimensional data, it is typically the case that parametric models are too rigid 
to explain all the variability present in the data. Conversely, flexible nonparametric approaches suffer from 
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the well known curse of dimensionality. With this in mind, a common approach is to make procedures more 
scalable to high dimensions by learning a lower dimensional subspace the data are concentrated near. This 
approach is supported by the success of mixture models with a few components in fitting high-dimensional 
data. In particular, consider a mixture of N Gaussian kernels, ^^-^ 7rj7Vm(-; /^j, cr^/m), iij G 3?™. The 
k — N — 1 largest eigenvalues corresponding to the covariance matrix for this type of density will typically 
be very large, while the remaining m — k eigenvalues will all be equal and relatively much smaller. We may 
visualize such data lying close to some affine k dimensional subspace of 3?™ containing the mean and the k 
corresponding eigen-vectors as its directions. If we knew that subspace, we could model the data projected 
onto that subspace with a nonparametric density model, while using some simple parametric distribution on 
the orthogonal residual vector. Robustness would be attained by fitting a fiexible model on only a selected 
few coordinates. 

There is a large literature on the estimation of Euclidean subspaces, affine subspaces, and manifold 
subsets. Many procedures are algorithmic based. Elhamifar and Vidal ^llj propose an algorithmic based 
method of clustering data that lie close to multiple affine subspaces. See the references there in for a nice 
overview of algorithmic type approaches. Because such methods are deterministic, no measures of uncertainty 
are available. A probabilistic modeling approach is proposed by Chen et al. [7] . They employ a fully Bayesian 
model for density estimation of high dimensional data that reside close to a lower dimensional subregion 
(possibly a manifold) of unknown dimension. This subregion is approximated using a nonparametric Bayes 
mixture of factor analyzers in which Dirichlet and beta processes are employed to simultaneously allow 
uncertainty in the number of mixture components, the number of factors in each component and the locations 
of zeros in the loadings matrix. Although their methodology is fiexible, it is very much a complex and over- 
parametrized "black box" leading to challenging computation. 

We propose a fully Bayesian procedure that very flexibly and uniquely identifies a lower dimensional 
affine subspace in a coherent modeling framework. After having identified the subspace and its dimension 
we model the coordinates of the orthogonal projection of the data onto that subspace using an infinite mixture 
of Gaussians while independently using a zero mean Gaussian to model the data component orthogonal to 
that subspace. Among all possible coordinate choices, we prefer isometric coordinates (those which preserve 
the geometry of the space). To obtain such coordinates, an orthogonal basis for the subspace must be 
employed which will require working on the Stiefel manifold (the space of all such basis matrices). In 
addition to interpretability and identifiability, advantages to using an orthogonal basis include equivalence 
of matrix inversion and transpose and faster MCMC convergence. We do not limit the cluster contours to be 



2 



homogeneous, but use a singular value decomposition type sparse representation for the kernel covariance. 
By doing so, we avert the problem of dealing with massive matrices and yet make the model highly flexible. 

An appealing feature to our methodology is that it is not a "black box" , rather nice interpretations 
accompany model parameters. For example, when estimating the affine subspace, which is proved to be 
unique, concern lies in estimating the orthogonal projection matrix associated with that space, and its 
orthogonal shift from the origin. Indeed, under our setting, the subspace turns out to be the /c-principal 
subspace for the distribution, k being the subspace dimension. In this regard, the methodology developed here 
provides a coherent extension of the Principal Component Analysis (PCA) of Hoff [17j to a nonparametric 
setting. The estimation of the projection matrix and orthogonal shift are carried out explicitly under 
appropriate loss functions. 

We also consider building efficient classifiers that entertain a high dimensional feature space. The idea 
is to seek the minimal subspace of the feature space such that the response depends on the predictors only 
through their projection onto that subspace. There has been recent developments in the machine learning 
and statistical communities with regards to building classifiers in the presence of a high dimensional feature 
space. Sun et al. |28j propose a classifier that essentially breaks a complex nonlinear problem into a set of 
local linear problems that scales nicely to a very high dimensional space. They also provide a nice review 
of algorithmic based procedures to building classifiers most of which are black boxes and estimation of a 
principal subspace is not entertained. Recently, Cucala et al. |10j proposed a probabilistic perspective to the 
A:-nearest neighbor classifiers. However, apart from not scaling well to a high dimensional feature space, the 
minimal subspace of the feature space is not estimated. Estimating a minimal subspace of a high dimensional 
feature space has been addressed in a regression setting. Tokdar et al. |29j model the conditional distribution 
of a response given the minimal subspace directly with a Gaussian process. Recently, Reich et al. |23j propose 
a method of sufficient dimension reduction by modeling a conditional distribution directly after placing a 
prior distribution on the minimal subspace (which they call a central subspace). See references there in 
for frequentist approaches to estimating this subspace. Hannah et al. [I3j use Dirichlet process mixtures 
to flexibly model the relationship between a set of features and a response in a generalized linear model 
framework. Shahbaba and Neal [27] focus on Dirichlet process mixture models in a nonlinear modeling 
framework. 

We focus on modeling the joint so that given the subspace, the response and the projection of the features 
onto that subspace follow a nonparametric infinite mixture model while the feature component orthogonal 
to the subspace follows a parametric model independent of the response and the projection. Dependence 
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between the response and features is induced through the mixture distribution. 

The remainder of this article is organized as follows. Section 2 provides some preliminaries, Section 
3 details the class of models to be used for density estimation along with theoretical results dealing with 
large prior support and strong posterior consistency. In Section 4 we investigate the identifiability of model 
parameters and give details of their estimation. Section 5 details computational strategies while Section 6 
outlines a small simulation study and examples. In Section 7 wo develop an efficient classifier and provide 
some examples and a small simulation study in addition to briefly introducing ideas with regards to regression. 
We finish with some concluding remarks in Section 8. 

2 Preliminaries 

A fc-dimensional affine subspace of 5?™ (which is a fc-dimensional Euclidean manifold) can be expressed as 

S = {Ry + e:y€^"'} 

with R being a m x m rank k projection matrix (it satisfies R = R' = R^, rank(i?) = k) and 6 G 
satisfying R9 = 0. Notice that there is a one to one correspondence between the subspace S and the pair 
(i?, 6) with 6 being the projection of the origin into S and R the projection matrix of the shifted linear 
subspace 

L = S-0 = {Ry:ye^"'}. 

The projection of any x G Si™ into S is defined as the xq G S satisfying \\x — xo\\ = min{||a; — y\\ : y G S} 
where || • || denotes the Euclidean norm. For any afiine subspace S as defined above, the solution turns out 
to be xo = Rx + 6. Similarly, the projection of a; e into Lis Xq = Rx, hence the name projection matrix 
for R. We denote the projection of x e into S as Prs{x). 

Each x e 3?™ can be given coordinates x E such that x = Ux + 9 where [/ is a matrix whose columns 
{Ui, . . . , Uk} form a basis of the column space of R. If U is chosen to be orthonormal (i.e., U'U = Ik and 
R = UU'), then the coordinates (x) are isometric. That is, they preserve the inner product on S (and hence 
volume and distances). With such a basis, the projection Prs{x) of an arbitrary x e into S has isometric 
coordinates U'x. Thus, U gives k mutually perpendicular 'directions' to S while 6 may be viewed as the 
'origin' of S. We will call 6 the origin and U an orientation for S. 

The residual oi x G 3?™ (which we denote as Rs{x) = x — Prs{x) = x — Rx — 6) lies on a linear subspace 
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that is perpendicular to L. That is, Rs{x) G S"^ where 

S^ = {il-R)y:ye^"'}. 

Notice that the projection matrix of S'^ is / — i?. Now if we let V denote an orthonormal basis for the 
column space of / — i? (i.e., V'V = I,n-k, VV = I — R), then isometric residual coordinates are given by 
V'x e 3?"-'=. 

For a sample lying close to such a subspace S, it is natural to assume that the data residuals are centered 
around with low variability while the data projected into S comes from a possibly multi-modal distribution 
supported on S. Figure [T] illustrates such a sample cloud. The observations are drawn from a two-component 
mixture of bivariate normals with cluster centers (1,0) and (0, 1) and band-width of 0.5. As a result they 
are clustered around the subspace (line) x + y = I. For a specific sample point x, Prs{x), Rs{x), and are 
highlighted. 




Figure 1: Graphical representation of the affine subspace (S), the orthogonal shift (9), and the projection of 
a point into S (these are the solid dots with particular emphasis given to Rx + 9). 
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If we let Q to be a distribution on 5ft™ with finite second order moments, then for d < m the d principal 
affine suhspace of Q is the minimizer of foUowing risk function 

R{S)= I \\x-Prs{xWQ{dx), (2.1) 

with the minimization carried out over all d-dimensional affine subspaces S. The minimum value of expression 
2.1 turns out to be X^d+i-^j; where Ai > ... > \m are the ordered eigenvalues of the covariance of Q. 
In addition, a unique minimizer exists if and only if > A^+i. If this is indeed the case, then the d 
principal affine subspace {So) has projection matrix R — UU' (here U is any orthonormal basis for the 
subspace spanned by a set of d independent eigenvectors corresponding to the first d eigenvalues) and origin 
^ [I — R)fi (with fi being the mean of Q). Notice that when d — 0, So is the point set /i. 
In the case that d is unknown, we can find an optimal value of d by considering 

i?(d, 5) = /(d) + / \\x-Prs{x)\\^Q{dx), 0<d<m (2.2) 

as a risk function for some fixed increasing convex function /. For / linear, say, /(d) = ad, a > 0, the risk 
has a unique minimizer if and only if A^+i < a < A^; for some d, with Aq = oo and A„i+i = 0. Then the 
minimizing dimension do is that value of d while the optimal space So is the do principal affine subspace. 
We will call do the principal dimension of Q. For the observations in Figure [T| the principal dimension is 
do — I with principal subspace 

{ \ -1/2 1/2 

Before detailing general modeling strategies, we introduce notation that will be used through out. By 
(5) we denote the space of all probabilities on the space S. M{m, k) will denote real matrices of order mxk 
(with M{m) denoting the special case of m = A;), M+(m) will denote the space of all m x to positive definite 
matrices. For U G M{m, k), C{U) and M{U) will represent the column and null space of U respectively. We 
will represent the space of all to x to rank k projection matrices by Pk,m- That is, 

Pfe ,„ = {Re M (m) : R = R' ^ R\ rank(i?) = k}. 

One important manifold referred to in this paper is the Steifel manifold (denoted by Vk,m) which is the 
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space whose points are A:- frames in 3?™ (here fc-frame refers to a set of k orthonormal vectors in 3?™). That 
is, 



We denote the orthogonal group {A G 3?"*: A' A = /„} by 0(m) which is Vm.m- The space Vk^m is a 
compact non-Euchdean Ricmannian manifold. Because M(m, k) is embedded in Euclidean space, it inherits 
the Riemannian metric tensor which can be used to define the volume form, which in turn can be used as the 
base measure to construct a parametric family of densities. Several parametric densities have been studied 
on this space, and exact or MCMC sampling procedures exist. For details, see Chikuse [9]. One important 
density which we will be using as a prior is the Bingham-von Mises-Fisher density which has the expression 



The parameters are A S M{k,m), B e M{k) symmetric and C G M{m), while etr denotes exponential 
trace. As a special case, we obtain the uniform distribution which has the constant density l/Vol(Vjij_„i). 

3 Density model 

Consider a random variable X in Sfi™. Let there be a fc dimensional affine subspace 5, < fc < m, with 
projection matrix R and origin 9 such that the projection of X into this subspace follows a location mixture 
density on the subspace (with respect to its volume form) given by 



where y S is the projection of x with parameters Q G A4{S), U G Vk.m, and A a m x m positive semi- 
definite (p.s.d.) matrix such that U'AU G 7\/+(fc). When k = 0, S denotes the point set {9} and Y = 9. 
Note that the density expression depends on U only through UU' . A general choice for A besides being 
positive definite (p.d.) could he A = UqTiq^Uq for some specific orientation Uq and p.d. Sq G Af^(fc). As 
a result, the isometric coordinates U^X of Prs{X) follow a non-parametric Gaussian mixture model on 3?*^ 
given by 



14, 



{A G M{m,k) : A' A = h}. 



BMF{x; A, B, C) cx etr(A'a; + Cx'Bx). 



Y = PrsiX) ^ f (2^)-'=/2|C/'At/|i/2 exp{-l{y - wyA{y - w)}Q{dw) 




(3.1) 
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Here ^ — UqW for w E S. Independently, let the residual Rs{X) follow a mean zero homogeneous density 
on S-^ given by 

i?5W^a-('"-'=)exp{-Mf}, 

X S S-^ and parameter cr > 0. li k = m, then S-^ = {0} and Rs{X) =0. As a result, with any orientation 
V E Vm-k.m for 5^, the isometric coordinates V' X of Rs{X) follow the Gaussian density 

V'X ^ Nra-ki; V'0, fT2/„,_fe). (3.2) 



Combine equations (3.1) and (3.2) to get the full density of X as 



X^f{x-Q)= N^(x■^{^Ji),Y.)P{d^i), (3.3) 
0(/i) ^Ua^i + e, E - [/o(I]o - fT'4)(7o + <T'/m, (3.4) 



with parameters 8 = (fc, C/o, Sq, cr, P). Here C/q E Vk.m and e satisfies ^7Q^^ = 0. The affine subspace 
S has projection matrix R = UqUq and origin 6. For fc = 0, /(x;0) = Nm{x;9,<T^Im)- Using a flexible 
multimodal density model for a few data coordinates (which are chosen using a suitable basis) and an 
independent centered Gaussian structure on the remaining coordinates allows efficient density estimation on 
very high dimensional spaces. 

A common choice of nonparametric prior on P can be a full support discrete model, such as a Dirichlct 
process, which allows clustering of the data around S. An alternative way to identify the intercept 9 would 
be to set it equal to E{X). However, this would require the prior on P to be such that ji ^ J iJ,P{dfi) = 
making the Dirichlet process prior inappropriate. For this reason, we set 9 to be the origin of S instead. 

With Eg p.d. and cr^ > 0, the within cluster covariance E lies in A/^(m) and has a sparse representation 
without being homogeneous. The residual variance cr^ dictates how "close" X lies to S, with = implying 



that X E S. In (3.3), one may mix across Eq by replacing P{dfi) by P{dfi dEo) and achieve more generality. 



To make model (3.3) even more sparse, without loss of generality, we can allow Eq to be a p.d. diagonal 
matrix. To prove that we do not lose any generality, consider a singular value decomposition (s.v.d.) of 
a general Eq, say Eq = ODO', O E 0{k), and replace Eq by diagonal D, and Uq by UqO' . If P is 
appropriately transformed, then the model is unaffected. With a diagonal Eq, the within cluster covariance 
has k eigenvalues from Eq and the rest all equal to . The columns of Uq are the orthonormal eigenvectors 
corresponding to Eq. 
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It is easy to check that S is the fc-principal subspace for the model, if and only if Sq + Jvj^kifJ- ~ ^ 
Ji)'P{dii) > a^Ik- Here Ai > refers to A — i? being p.d. This holds, for example, when Eq ^ cr'^/fc and P 
is non-degenerate. Further under the model, k is the principal dimension of X for a range of risk functions 



as in (2.2) with linear /. 



3.1 Weak Posterior Consistency 



Consider a mixture density model / as in (3.3). Let 15(3?™) denote the space of all densities on . Let 



denote the prior induced on 2?(3fJ™) through the model and suitable priors on the parameters. Theorem 3.1 
shows that satisfies the KuUback-Leibler (KL) condition at the true density /t on 3?™. That is, for any 
e > 0, n/(i4rc(/f)) > 0, where K^{ft) = {/; KL{ft;f) < e} denotes a e-sized KL neighborhood of ft and 
KL{ft-f) = / log -^ftdx is the KL diver gence. As a result, using the Schwartz theorem [5S], weak posterior 
consistency follows. That is, given a random sample X„ — Xi, . . . ,X„ i.i.d. ft, the posterior probability of 
any weak open neighborhood of ft converges to 1 a.s. ft- 

Let p{k) denote the prior distribution of k. We consider discrete priors that are supported on the set 
{0, . . . , m}. Let TTi{Uo, 9\k) denote some joint prior distribution of Uq and 9 that has support on {{Uq, 6) G 
14, m X 3^™ '■ U'qO = 0}. As previously recommended, we consider a diagonal Sq = diag{af, . . . , af) and set 
a joint prior on the vector cr = (cr, cti, . . . , cr/j) G (3?^)'^^^ that we denote with 7r2(cr|fc). Further, we assume 



that parameters {Uq, 9), cr, and P are jointly independent given k. That said, Theorem 3.1 can be easily 



adapted to other prior choices. We also consider the following reasonable conditions on the true density ft- 
Al: < ft{x) < A for some constant A for aU x G 3?™. 
A2: \ J log{ ft{x)}ft{x)dx\<^. 

A3: For some S > 0, J log j^ft{x)dx < oo, where fs{x) = inf,y:||,y_^||<5 ft{y). 
A4: For some a > 0, / \\xf'^^+°''^"' ftix}dx < oo. 

Theorem 3.1. Set the prior distributions for k, (Uq, 9), cr, and P to those described previously such that 
p{m) > 0, 7r2(3fi+ x (0, e)"'|/c = m) > for any e > 0, and the conditional prior on P given k — m contains 
Pf^ in its weak support. Then under assumptions A1-A4 on ft, the KL condition is satisfied byllf at ft. 

Proof. The result follows if it can be proved that Ilf{K^{ft)\k = m,Uo) > for all e > and Uq G 0(m), 
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because then 



^fiK.ift)) > p{m) / UfiK,ift)\k = m, Uo)d7riiUo\k - m) > 

JO(m) 



Now, given k = m and Uq, density (3.3) can be expressed as 



f{x;Q,^)= N,^{x-u,^)Q[dv), (3.5) 



with Q — P o (f)~^. Here (j){x) — Uqx, and E = L/o^oC^o- The isomorphism cj) : 5R™ ^ SR™ being continuous 
and surjective ensures the same for the mapping P i-^ Q. This in turn ensures that under the Theorem 
assumptions on the prior, the prior on P and a induces a prior on Q that contains Pf_^ in its weak support and 
an independent prior on S which induces a prior on its maximum eigen-value that contains in its support. 
Then with a shght modification to the proof of Theorem 2 in Wu and Ghosal [32 , under assumptions A1-A4 
on ft , we can show that ft is in the KL support of Ily . □ 

3.2 Strong Posterior Consistency 



Using the density model (3.3) for ft, Theorem 3.5 estabhshes strong posterior consistency, that is, the 



posterior probabihty of any total variation (or Li or strong) neighborhood of ft converges to 1 almost surely 
or in probability, as the sample size tends to infinity. The priors on the parameters are chosen as in Section 



3.1 To be more specific, the conditional prior on P given fc (fc > 1) is chosen to be a Dirichlet process 
DP{wkPk) {wk > 0, Pk & A^(5R'^)). The proof requires the following three Lemmas. The proof of Lemma 



(3.2) can be found in [T], while the proofs of Lemmas (3.3) and (3.4) are provided in the appendix 



In what follows Br^m refers to the set {x e 5R™ : ||a;|| < r}. For a subset V of densities and e > 0, the 
Li-mctric entropy N{e,'D) is defined as the logarithm of the minimum nmiibcr of e-sized (or smaller) Li 
subsets needed to cover V. 

Lemma 3.2. Suppose that ft is in the KL support of the prior Hf on the density space I?(5ft™). For every 
e > 0, if we can partition 2?(3fi™) as 2?^ U X'^'^ such that N{e,V%)/n — > and Pr(i:),f |X„) — > a.s. 
or in probability Pf^, then the posterior probability of any Li neighborhood of ft converges to 1 a.s. or in 
probability Pf^ . 

Lemma 3.3. For positive sequences hn ^ and r„ — )• oo and e > 0, define a sequence of subsets of'D{W^) 
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Vt, = {/(se) : e e K), K = {©: min(<T) > h^, \\d\\ < r„,P{B'f^,) < e} 



with f (■■.&) asm (3. 3|. Set a prior on the density parameters as in Section 3.1 Assume that supp{n2{-\k)) G 
[0, for some ^ > for all < k < m. Then N{e, I?^) < C(r„//i„)™ where C is a constant independent 

ofn. 



Lemma 3.4. Set a prior as in Lemma 3.3 with a DP{wkPk) prior on P given k, k > 1. Assume that the 
base probability Pk has a density pk which is positive and continuous on SR*"'. Assume that there exist positive 
sequences hn — > and rn — > oo such that 

Bl : Jim^n<5->-'=exp(-r,V8A^) = 

holds where 

Skn = mf{pfe(M) : A* G IImII <A + r„/2}, k = l,...,m. 

Also assume that under the prior 7r2(-|fc) on cr , Pr(min((T) < /i„|fc) decays exponentially. Then under the 
Assumptions of Theorem\3.1\ for any e > 0, fc > 1, 



%{Pr(P(B,^„,)>e|fc,X„)}^0. 

If^l is strengthed to 

oo 

Bl' ■.Y.^^krlh-'expi-rl/SA') < oo, 

n=l 

and the sequence rn satisfies rn^^^"*""^™ < cxi with a as in Assumption A4, then the conclusion can be 

strengthed to 

oo 

Y,Ef,{Pr{P{B'^^,,)>e\k,Xn)}<^. 

n=l 

With these three Lemmas we are now able to state and proof the theorem that ensures strong posterior 
consistency is attained. 



Theorem 3.5. Consider a prior and sequences hn and rn for which the Assumptions of Lemma \ 3.4\ are 
satisfied. Further suppose that ?T.^^(r„/ft,„)'" — > 0. Also assume that the sequence rn and the prior 7ri(-|fc) 
on {U,9) satisfy the condition Pr(\\0\\ > r„|fc) decays exponentially for k < m ~ I. Assume that the true 
density satisfies the conditions of Theorem\3.1\ Then the posterior probability of any Li neighborhood of ft 
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converges to 1 in probability or almost surely depending on Assumption Bl or Bl'. 



Proof. Theorem 3.1 implies that the KL condition is satisfied. Consider the partition I?(5R'") — Vf^ U 



Then N{e,VfJ/n — > 0. Write 

Pr(I?-|X„) - Fr({/(.;e) : 9 G i/,r}|X„), 

where 

= {© : niin(ff) < /i„} U {6 : H^H > r„} U {6 : P(fi;;„fc) > e}. 

The posterior probabihty of the first two sets above converge to a.s. because the prior probabihty decays 
exponentially and the prior satisfies the KL condition. Note that 

m 

Pr{{e : > e}|X„) <Y.Pr{{e : P{B'^^^) > e}|X„,fc = j) 



and Lemma 3.4 implies that this probability converges to in probability/a.s. based on Assumption Bl/Bl'. 



Using Lemma 3.2 the result follows. □ 



Now we give an example of a prior that satisfies the conditions of Theorem |3.5| Any discrete distribution 
on {0, . . . , m} having m in its support can be used as the prior p for k. Given k (k > 1), we draw Uq from a 
density on Vk^m- Given k and Uq, under tti, 9 is drawn from a density on the vector-space Af{Uo) if fc < m. 
If k = TO, then 9 — 0. When fc < m, we set 9 — r9 with r and 9 drawn independently from !R+ and the set 
{9 e : \\9\\ ~ 1,9'Uo = 0} respectively. The scalar r" is drawn from a Gamma density for appropriate 
a > 0. As a special case, a truncated normal density can be used for 9 when 9 is drawn uniformly, a = 2 
and r^ ^ Gam{l, ao), cto > 0. Then 9 has the density 



-('"-'^)exp-l||0|p/((?U = O) 

Z(7, 



with respect to the volume form of A/'([/o)- Given fc, cr follows 7r2 supported on [0,v4]'^+^. Under tt-z, the 
coordinates of cr may be drawn independently with say, aj^ following a Gamma density truncated to [0, A]. 
If reasonable, assuming ai = . . . = ak = cr with a^^ following a Gamma density will simplify computations. 
That said, a Gamma distribution only satisfies the conditions of Theorem |3.1| when to > 2. To satisfy the 
conditions of Theorem |3.5| a truncated transformed Gamma density may be used. That is, for appropriate 
6 > 0, we draw cr^'' from a Gamma density truncated to [0, A]. Given fc, fc > 1, P follows a DP{wkPk) 



12 



prior. To get conjugacy, we may select Pk to be a Gaussian distribution on Sft'^ with covariance r^Ik. With 
such a prior the conditions of Theorem 3.5 are satisfied if we choose a,b,T and A such that > AA^, 



a < 2(1 + a)m and a ^ + b ^ < m ^. This result is available from Corollary 3.6 the proof of which is 
provided in the Appendix. 

Corollary 3.6. Assume that ft satisfies Assumptions A1-A4. Let II f be a prior on the density space as in 



Theorem 3.5 Pick positive constants a-,b, {t^Y^^^ and A and set the prior as follows. Choose 7ri(.|fc) such 
that for k < m ~ 1, H^H" follows a Gamma density. Pick 7r2(.|fc) such that cr, Ci, . . . , CTfe are independently 
and identicaly distributed with following a Gamma density truncated to [0,^]. Alternatively let a = 
(Ti = . . . = CTfe with a distributed as above. For the DP{wkPk) prior on P , k > \, choose Pk to be a normal 
density on SR'^' with covariance T^Ik- Then almost sure strong posterior consistency results if the constants 
satisfy t| > iA^ , a < 2(1 + a)m and 1/a + l/b < 1/m. 

A multivariate gamma prior on cr satisfies the requirements for weak but not strong posterior consistency 



(unless m — 1). However that does not prove that it is not eligible because Corollary 3.6 provides only 
sufficient conditions. Truncating the support of cr is not undesirable because for more precise fit we are 
interested in low within cluster covariance which will result in sufficient number of clusters. However the 
transformation power b increases with m resulting in lower probability near zero which is undesirable when 
sample sizes are not high. 

In [S], a gamma prior is proved to to be eligible for a Gaussian mixture model (that is, k = m) as long 
as the hyperparameters are allowed to depend on sample size in a suitable way. However there it is assumed 
that ft has a compact support. We expect the result to hold true in this context too. 

4 Identifiability of Parameters 

In many applications, the goal may not be density estimation but estimating the low dimensional set S and 
its dimension. To do so S must be identifiable. That is, there must be a unique S corresponding to the 
model (3.3). Denoting by P/, the distribution corresponding to /, it follows that 



P/ =iV„(0,I])*(Po^-i), (4.1) 
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with * denoting convolution. Now let $p(t) be the characteristic function of a distribution P, then (4.1) 
implies that the characteristic function of / (or Pf) is 

^f{t) = exp(-l/2t'Et)$poA-i {t), te W\ (4.2) 



Once we let P to be discrete, (4.2 1 suggests that S and P o ^ can be uniquely determined from /. Now 
: 5R'= — ^ SR", 0(3?'=) = 5 and P o is the distribution of <j){Y) with Y ^ P. It is a distribution on 3?™ 
supported on the k dimensional afhne plane S. To identify S and k, we further assume that the ajjine support 
asupp(P) of P is 3?*^. We define asupp(P) as the intersection of all affine subspaces of 3?'= having probability 
1. It is an affine subspace containing supp(P) (but may be larger). In other words, we use a prior for which 
P is discrete and asupp(P) ~ 3?'= w.p. 1. The Dirichlet process prior on P given k with a full support base is 
an appropriate choice. Then, from the nature of (j), asupp(Po (j>^^) is an affine subspace of 3?™ of dimension 
equal to that of asupp(P). Since asupp(P o (f>~^) is identifiable, this implies that k is also identifiable as 
its dimension. Since S contains asupp(P o (p^^) and has dimension equal to that of asupp(P o 0^^), hence 
S — asupp(P o Hence we have shown that the (sub) parameters (E, k,S,P o 0~^) are identifiable once 

we set a full support discrete prior on P given k. Then UqU[^ and 9 are identifiable as the projection matrix 
and origin of S. However P and the coordinate choice (hence Uq) are still non-identifiable. However, if we 
consider the structure S — UoY^oU'^ + (7'^{I,n — UqUq) with a diagonal Eo and impose some ordering on the 
diagonal entries of Eqj then the columns of Uq become identifiable up to a change of signs as the eigen-rays. 



4.1 Point estimation for subspace S 

To obtain a Bayes estimate for the subspace S, one may choose an appropriate loss function and minimize the 
Bayes risk defined as the expectation of the loss over the posterior distribution. Any subspace is characterized 
by its projection matrix and origin. That is, the pair (P, 9) where R G M{m) and 9 £ 3?™ satisfy R — R' — 
and R9 = 0. We use Sm to denote the space of all such pairs. One particular loss function on Sm is 

Ll((Pl,0l), (P2,02)) = - i?2|r + ll^^l - ^211', iR^,0^) G Sm- 

For a matrix A — ((a^j)), its norm-s quared is defined as \\A\\'^ ^ J^ij "-Ij ^ Tr{AA'). We find the average of 
Li over repeated draws of (P2, ^'2) from their posterior and choose the value of (Pi, ^i) for which the average 
is minimized (if a unique minimizer exists). Then the subspace S is estimated as {Rix + 9i : x £ 3?'"}. It 
has dimension equal to the rank of Pi . 
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If the goal is to estimate the directions of the subspace, we may instead use the loss function 



L2{{Ul,Wi),{U2,W2)) = \\Ui - U2\\^ + {W, - W2)^ {U,,W,) £ S,n2- 

Here the m x m matrix C/j has the first few columns as the directions of the corresponding subspace Si, 
the next column gives the direction of the subspace origin 9i and the rest are set to the zero vector while 
Wi — \\6i\\. Therefore 

r / 

/ 



Sm2 = < {U, w) e M(m) X n+ : U'U 







We find the minimizer (if unique) {Ui,wi) of the expected value of L2 under the posterior distribution of 
{U2,W2) and set the estimated subspace dimension k as the rank of C/i minus 1, the principal directions 
consisting of the first k columns oi Ui and the origin as Wi times the last column. Since the k orthonormal 
directions of the subspace are only identifiable as rays, one may even look at the loss 

i=i 

where 



(y,02) e Smz = <^ e M(m) X : U'U 





0) 




(: 













Theorems 4.1 and |4.2| (proofs of which can be found in the appendix) derive the expression for minimimizer 
of the risk function corresponding to Li and L2 and present conditions their uniqueness. Hereby we denote 
by Pfi the posterior distribution of the parameters given the sample. It is assumed to have finite second 
order moments. For a matrix A, by v4(i.) we shall denote the submatrix of A consisting of its first k columns. 

Theorem 4.1. Let fi{R,9) = /^^^ Li((i?, 6*), (i?2, 6'2))dP„(i?2, 6*2), iR,9) e S. This function is mini- 
mized by R = UjU'j and 9 = {I - R)92 where R2 = jM(m) R2dPn{R2) and ^2 = 62dPn{(^2) are 
the posterior means 0/ i?2 and ^2 respectively, 2R2 — ^2^2 ~ EjLi'^j^i^j; Ai > . . . > A„j is a s.v.d. of 
2R2 — ^2^2' and k minimizes k — X]^=i '^^i {0, . . . , m}. The minimizer is unique if and only if there is a 
unique k minimizing k — X]j=i and Xk > X^+i for that k. 

Theorem 4.2. Let f2{U,w) = /^^^^ L2{{U,w), {U2,W2))dPn{U2,W2), {U,w) £ Sm2- Let w and U denote 
the posterior means of W2 and U2 respectively. Then /2 is minimized by w — Hj and any U — [Ui , 0] , where 
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Ui e Vk+i,m satisfys U(^k+i) = f^i(t^(fc+i)f^(fc+i))^^^; and k minimizes g[k) = k- 2TY(U[f^^^~^U(^k+i)Y^'^ over 
{0, . . . , m — 1}. The minimizer is unique if and only if there is a unique k minimizing g and C/(fc_|_i) has full 
rank for that k. 

5 Posterior Computation 

We now present an algorithm to sample from the joint posterior distribution of = (fc, Uq, 9, Sq, ct, P) and 
as a result the density of X, given iid realizations Xi, . . . , X„. Since exact sampling is not possible, we resort 
to MCMC draws from the posterior. We first present an algorithm with k being treated as a fixed known 
quantity. We then generalize the algorithm to allow unknown k. In both cases, a straight forward Gibbs 
sampler can be used. 

5.1 MCMC algorithm for the fixed k 

We use a Dirichlct process (DP) prior for P (i.e., P ^ DP{woPa)). For simplicity and to preserve conjugacy 
we set Pq — iVfc(mp, S^) with wq — 1. We employ the stick breaking representation of the Dirichlet process 
(Sethuraman [26 ) so that P — ^'jLi WjS^. where /ij is drawn iid from Po and Wj = Vj Jl£<j(l ~ "^i) with 
Vj ~ Beta{l, Wq). After introducing cluster labels Si, . . . , Sn, the likelihood becomes 

n 

fix; Uo, e, Eo, P, M, ^) - n ^s,^n^{x^■, Uofis, + 6, S) (5.1) 

1=1 

n 

= n ws^N^{U'^x,; ^Jis.,^a)Nr,^-k{V'x,■, Ve, a^I,n-k) (5.2) 

i=l 

where once again S = UoT,oUQ+a'^ {I„i — UoUq) . After prior distributions for (Uq, 6, Sq, ct, /i) are appropriately 
selected (details of which are given concurrently within the description of the algorithm) it is now possible 
to describe an algorithm that can be used to construct an MCMC chain that provides draws from the joint 
posterior distribution of interest by cycling through the following steps. 

Step 1. Let 7r(C/o) denote a prior distribution for Uo G Vk.m- Using straightforward matrix algebra it can be 
shown that the full conditional of C/q is 

n n 
i=l i=l 

cx etr{F{[/o + F2U'oFMAUq), (5-3) 
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where Fi — ^i^^'s^)^o^ ^ -^2 = 5(0" '^h — Sq^), and F3 = (5-3) etr(v4) denotes 



exp{tr{A)). Thus, if one selects a matrix Bingham-von Mises-Fisher prior distribution for Uq (the Uniform 
distribution on the Stcifel manifold being a special case) , then the full conditional of Uq is a matrix Bingham- 
von Mises-Fisher distribution on the space Uq9 — 0. Strategies for sampling from matrix Bingham-von 
Mises-Fisher are developed in Hoff (TS]. A straightforward extension of their work can be implemented to 
sample from a matrix Bingham-von Mises-Fisher that has U^d = as a constraint. 

Step 2. As discussed in Section 3.2 a good prior choice for 6* is a truncated normal 9 ^ Nminig, Sq)I\U'qO = 0]. 
The full conditional under this prior is the following truncated multivariate normal 

[0\-]^N^{mlS;)I[Ul,9 = O], (5.4) 

where Sg = (n'S^^ + Sg^^)^^ and nig = Sg (Y,^^ X]"=i ^« + Sg^mg). 
Notice that if W is an orthonormal basis of N{U'q), then there exists a 6* € Sfi™^*'' such that — W9 and 



Nm-k{Wml,W' SgW). This fact can be exploited to sample from (5.4) 



Step 3. Update Si for i = 1, 2, . . . , n by sampling from the multinomial conditional posterior distribution 
Pr{Si = oc Wj exp{-l/2{UQXi ~ ^Ij)'T.q'^(U'qXi - fij)}, j 1, . . . , 00. 

To make the total number of states finite the block Gibbs sampler of Ishwaran and James |19| may be 
implemented. Alternatively, the slice sampling ideas described in Yau, Papaspiliopoulos, Roberts, and 
Homes [33], Walker |3Tj, or Kalli, Griffin, and Walker EUj could be used. The remainder of the algorithm 
is described from the perspective of using a block Gibbs sampler which requires truncating the number of 
atoms to N. 

Step 4. Update the DP atom weights by setting wj = Vj ni=i (1 ~ ^Oj J — I7 • ■ ■ i A^ after drawing 



[VI 



-] - Beta{l + nj,wo + Y, ^^^^ > 



with Uj — J2i H^i — j) setting ujv — 1- 
Step 5. Update the DP atoms {/ij : j — 1, . . . , N} independently by sampling from 

[f,,\~]r^Nk{m;,s;), 
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where S; = (n,So ^ + S^Y' and = S;{U(,i:o' E + ^;'"^/^)- 
Step 6. Using a cr~^ ~ Ga(a, b) prior, cr~^ can be updated using 

[<7-2|-] ~ Ga(-n(m - fc) + a,6 + - ^ x',Xi + "^6' 9 - -Y^^W^Xi - e'J^Xi) 

i—l i=l i=l 

Under the simpUfying assumption that Sq = o'^/fe the full conditional of cr~^ becomes 

1 1 " 

[a-^\-]'^Ga.{-nm + a,b+-Y,{xi-Uofis,-Oy{xi-Uofis,-0)) 

Step 7. Using a truncated Gamma distribution for aj'^ (i-c, cr"^ ^ Gam{a,b)I[aJ^ G [0,v4]]) allows one to 
update using the following truncated Gamma distribution. 

[a-'\-] ~ GAM(| + a, 6 + - ^(t/^o;, - MsJ,')/^"^ e [0, A]]. 

i=l 

Reasonable starting values can decrease the number of MCMC iterates discarded as burn in and therefore 
may be desirable. For Uq, the first k eigen- vectors of the sample covariance matrix can be used. For 6 one 
may use ~ UaU'g)x where Ug denotes the starting value for Uq. The initial labels {Si) and coordinate 
cluster means {^j) can be obtained by applying a k-means algorithm to U'^Xi. 

5.2 MCMC algorithm for k unknown 

In the case that k is unknown, a prior distribution needs to be assigned to k and C/q € 0{m). In what follows, 
to denote the fcth coordinate and the 1st k coordinates of Hj we use fijk and Mj(fe) respectively. Similarly, 
let f/o(fe) represent the first k columns of Uq while J7o(-fe) will represent the remaining m — k columns. 
After introducing cluster labels, the full posterior is proportional to 

n 

Tr{w,iJ.,a,T,o,Uo,e,k,S) a JJ wSi-/Vm(a;i; C/o(fe)MSi(fe) +6,T,). 

i=l 

Here tt is a general expression for the prior. The first k columns of the mxm matrix Uq explain the subspace 
directions and the first k coordinates of Hj the cluster locations. 

Allowing k to be unknown requires altering steps 1 and 5 of the MCMC algorithm described in the previous 
section and adding an additional step. We first describe the additional step and then the adjustments to 
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steps 1 and 5. Continuing from step 7 from the previous section we add 
Step 8. Update k by drawing a value for k from the fohowing complete conditional 

n 

Pr{k = e\-) cx p{i) l[ Nm{x,- Uoie)fis,ii) + 0, E) for ^ = 1, . . . , m - 1. (5.5) 

1=1 

When the data dimension m is very high, computing all m — 1 probabilities can become computationally 
expensive. An approach to reduce the number of states would be to introduce a slice sampling variable u 



drawn from Unif (0,1). In this setting we replace p{k) in (5.5 1 by I{u < p{k)). This means that k will 
be drawn from the set {k: p{k) > u\ and u ~ Unif(0,p{k)). Updating the upper bound for the subspace 
dimension (A') can be done by drawing u ~ Unif{0,p{k)) and setting K = max{A: < m : p{k) > u)}. 

Step lb. Use the complete conditional derived in step 1 from Section 6.1 to update f/o(A;), then draw ?7o(_fe) = 
[Uok+i, . . . , Uqk] from Tr{Uo(-K}\Uok) such that ?7o(-fe)^ = 0- 

When a uniform prior is being considered, steplb requires one to sample uniformly from VK^k.m perpen- 
dicular to the column space of [C/q/cj ^] = Ug. As discussed in Chikuse[8j, U* is a uniform sample from VK-k.m 

a mx [K — k) matrix of independent standard normal random variables. To ensure 
that U* e N{U'g) first project T into N{U'g) by setting T* = {I - UgU'g)T. Then U* = t* {T*' T*)-^/^ is a 
uniform draw from VK-k,m perpendicular to column space of Ug. If tt{Uq) is not a uniform distribution on 
0{m) see Hoff [18] for sampling strategies. 

Step 5b. Use the full conditional found in step 5 from Section 6.1 to update ^J■j{k)■ Then draw fj,jk+i, . . . , /ij^r 
from their respective prior distributions. 

With k unknown, the MCMC chain tends to get stuck on certain values of k for many iterations. The 
stickiness occurs because the probabilities in step 8 are computed for all i — 1, . . . , K using a Uq that was 
updated for a particular value of k. To make the chain less sticky, we employ adaptive MCMC methods as 
outlined in Roberts and Rosenthal [24j. We applied the adaptation to step 8 and step 5 of the algorithm. 



Specifically, we raised each of the un-normalized probabilities in (5.5) to the 1 — exp(— O.OOOli) power 

with 



5.1 



(where t = denotes the t^^ MCMC iterate) and replace 5** found in step 5 of Section 

(1 + 100 exp(— 0.001t))S'* . In this way, the space of cluster locations is initially more thoroughly explored. 
Notice that the adaptation vanishes at an exponential rate, which guarantees that the proper regularity 
conditions hold. 
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6 Simulation Study 

To assess the proposed methodology's density estimation abihty we conducted a smah simulation in which 
a density is estimated using observations in iJi™ originating from the following finite mixture 

c+l 

X ^^nhN^{r]h,(j'^I). (6.1) 

h=l 

Here ijh is a vector of zeros save for the hth entry which is 1. We considered the following three factor's 
influence on the density estimate. 

1. Bandwidth (setting cr^ = 0.01, = 0.05, and = 0.1) 

2. Sample size (setting n = 50, n = 100, n = 200) 

3. Dimension of the afhne subspace (considering k — 2 and fc = 5). 



To show that (6.1 1 falls into the current class of models, consider the case of fc = 2 and m = 100. For this 
case we have the 100-dimensional vector 9 = (1/3, 1/3, 1/3, 0, ... , 0)'. Further one possible representation of 
the 100 X 2 dimensional Uq is 



, 1/V2 -1/V2 ... , 
Uo^\ _ 1 . (6.2) 

1/^6 1/V6 -2/VG ■•■ < 

As competitors, we considered a finite mixture with f{x) = "^^-^m) infinite 

mixture f{x) = X^hLi ''^hNm{l^h-:<^'^Im)- The number of components employed in the finite mixture were 3 
and 6 for the two respective afhne subspace dimensions considered. For each synthetic data set created, 100 
observations were generated to assess out of sample density estimation. To compare the density estimates 
between the procedures employed, we used the following Kullback-Leibler type distance 



, D . T /lOO 100 \ 

^ E T E E log - E i°g /*(^«) 

d=l t=l \l=l 1=1 ) 



(6.3) 



Here /o denotes the true density function, d is an index for the D — 2h datasets that were generated, and 
is the ^th out of sample observation generated from the dth data set and /( is the estimated density. 
For each of the 25 generated data sets, a density estimate was obtained using the proposed method with 
fc unknown and for k = I, k — 2, and fc = 5. We entertained a discrete uniform and stick-breaking type prior 
for k with no appreciable difference in parameter estimation. We set ai — . . . ,ak = o. For each scenario 
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1000 MCMC iterates were used to approximate the density. A burn-in of 1000 was used when k was fixed. 
When k was considered an unknown a burn-in of 10,000 was used with a thin of 100. Convergence was 
monitored using trace plots of the coUected MCMC iterates. 



The value of equation (6.3) for each scenario considered averaged across the 25 datasets can be found 
in Table [TJ Under the column "Unknown fc" can be found the results when k was treated as an unknown. 
The results from the method when k is fixed at a specified value can be found under one of the three "fc =" 
columns. Results from the finite mixture and infinite mixture are under the columns "Fin Mix" and "Inf 
Mix" . 



Table 1: Results of the KuUback-Liebler type distance comparing estimated densities from each of the 
procedures considered in the simulation study to the density used to generate data. 



True k 




n 


Unknown k 


k = 1 


fc = 2 


fc = 5 


Fin Mix 


Inf Mix 




0.01 


50 
100 
200 


582.98 
274.76 
139.21 


1557.39 
1494.65 
1474.90 


392.84 
205.49 
106.06 


412.77 
214.32 
111.85 


2580.81 
1539.74 
165.92 


2612.92 
1619.44 
1429.98 


2 


0.05 


50 
100 
200 


590.24 
271.79 
128.30 


421.93 
371.65 
315.85 


314.44 
172.61 
96.37 


394.53 
192.39 
105.34 


710.46 
465.87 
153.54 


714.26 
499.58 
160.66 




0.1 


50 
100 
200 


589.01 
280.99 
134.07 


232.33 
189.05 
162.34 


250.50 
154.91 
87.55 


365.38 
201.62 
104.65 


426.69 
320.02 
160.54 


426.29 
324.92 
176.29 




0.01 


50 
100 
200 


2292.44 
2075.99 
2138.87 


2645.34 
2564.26 
2503.26 


2268.70 
2164.32 
2065.54 


1015.80 
500.65 
256.78 


3003.87 
2341.99 
1646.43 


3029.25 
2838.46 
2046.68 


5 


0.05 


50 
100 
200 


872.18 
604.07 
506.53 


646.12 
604.73 
550.92 


654.20 
556.36 
489.39 


714.96 
421.40 
231.47 


798.29 
676.65 
460.85 


801.22 
690.04 
512.93 




0.1 


50 
100 
200 


773.15 
431.56 
283.02 


315.85 
294.42 
246.20 


357.02 
309.34 
237.94 


484.87 
358.66 
206.01 


447.79 
351.17 
286.96 


456.62 
353.89 
288.10 



Generally speaking, the procedure outlined in Section 3 does a much better job at recovering the true 
density relative to the mixtures. This is the case even if fc is fixed at the wrong value. That said, as expected, 
fixing fc at the true value provides the best results. The only instances in which the finite mixture estimated 
the density more accurately than our density estimator is when the dimension of the affine subspace is set 
to 5 and the sample size is small. However, even in small samples, if fc is fixed at the correct value, then 
the density is recovered more accurately using our procedure compared to mixtures. Also, it appears as 
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cr^ increases, then cluster separation diminishes and estimating k is more difficult. Hence the varying k 
procedure does not perform as well in estimating the density (which is to be expected) but still out performs 
the mixtures. In addition, as expected larger sample sizes are conducive to better density estimation as the 
KuUback-Leibler type distance generally gets smaller as n increases. 



7 Nonparametric Classification with Feature Coordinate Selection 

We consider a categorical Y that takes on values from the set {1, . . . ,c}. The goal of classification is to 
identify the class to which Y belongs using m characteristics of Y. These characteristics are typically 
denoted by X £ SR™. Because the association between X and Y may not be causal, our approach is to model 
X and Y jointly and from the joint derive the conditional. Letting Mc{y; v) = Jlfci we consider the 

following joint model 



{X,Y) ^ f{x,y) = Nrnix;^{^i),E)M,iy;u)Pidfidu), (7.1) 



with S'c = {i' S [0,1]'^: ^ = 1} denoting the c— 1 dimensional simplex. Note that (7.1 ) is a generalization 



of (3.3) and (3.4) along the lines of the joint model proposed in Bhattacharya and Dunson [J], though 
they focus on kernels for predictors on models that accommodate non-Euclidean manifolds and there is no 
dimensionality reduction. 

When m is large it is often the case that most of the information present in the data is used to model 
the marginal of X while the association between X and Y is disregarded. In order to avoid this, we instead 
pick a few coordinates of X, say k many, and model the joint density of the k coordinates of X and Y. The 
remaining coordinates of X are modeled independently as equal variance Gaussians, though in preliminary 
simulation studies, we find that our performance in estimating the subspace and predicting Y is robust to 
the true joint distribution of the 'non-signal' predictors that are not predictive of Y. By setting a prior on 
the coordinate selection method, we can pick out those few 'important' coordinates which completely explain 
the conditional distribution of Y, very flexibly. Without loss of generality an isotropic transformation on X 
can be used which would provide some benefit with regards to coordinate inversion. That is, we can locate 
a k < m and Uq £ Vkm such that 



{U'oX,Y)^ h{xuy)^ Nk{xi;fi,J:o)M,{y;i^)P{dfidi^), xie^\ (7.2) 
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along with a d £ 5ft™ and V E Vm-k,m satisfying V'Uq = and O'Uq = 0, such that 

V'X ^ Nm-k{V'9,a^Irn-k) (7.3) 



independently of {UqX,Y). With such a structure, the joint distribution of {X,Y) becomes (7.1) where 



: gfjfc ^ sfj™, ^ Uoy + 0, Uoe Vk,„^, 9 e 3fJ", U'^O - 0, 
S = f/o(So - a24)C/^ + a^/^, Sq G M+{k),a'^ G 5ft+. 

The conditional density of F = y given X = x can be expressed as 



with parameters Q — {k,UQ,Yio, P,9,a-). A draw from the posterior of given model (7.1) will give us a 
draw from the posterior of the conditional. When P is discrete (which is a standard choice), the conditional 
distribution of Y given X and Q can be thought of as a weighted c dimensional multinomial probability 
vector with the weights depending on X only through the selected fc-dimensional coordinates UqX. For 
example, if P = J^^jLi Wj5(^,.^^.), then 

oo 

p{y\x-Q) ^Y.'^,{Ul,x)MrXy;u,) (7.5) 



where Wjix) = t»jJVfc(a,yj,So) a; G Sft'' for j — 1, . . . , oo. We refer to \7.b\ as the principal subspace 

classifier (PSC). 

The above is easily adapted to a regression setting by considering a low dimensional response F G 5ft' and 
replacing the multinomial kernel used for Y with a Gaussian kernel. In this setting the joint model becomes 



{X,Y)^ / 7V™(x;0(M),S,)iV,(2/;^,S^)F(d^d7/;), (7.6) 



which produces the following conditional model 



P[y\x;0)^ f rfjjT} V ^p^^ ^ i\ • '^^■^) 
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For a discrete P this conditional distribution becomes the foUowing mixture whose weights depend on X 
only through its fc-dimensional coordinates UqX 

oo 

p{y\x;Q) = Y,^,{U;,x)Ni{y;i;j,j:y). (7.8) 

As the regression model is a straightforward modification of the classifier, we focus on the classification case 
for sake of brevity. 

7.1 MCMC algorithm 

Sampling from the posterior of O = {k,UQ,TiQ,P,6 ^a"^) requires adjusting step 3 of Section 6's algorithm 
and adding a step to update u. We continue to assume P ^ DP{a, Pq). However, in the present setting 
Pq — N{m,S) (E) Dir{a^). Now the data likelihood, after introducing cluster labels Si, . . . , Sn, becomes 
nr=i '^siN„i{xi; UfiSi + So)Mc(?/i; i^Si)- An MCMC chain that provides draws from the joint posterior of 
Q can be obtained by adding the following two steps to the algorithm in Section 6. 

Step 3. Update Si for i = 1, 2, . . . , n by sampling from the following conditional posterior distribution 

c 

Pr{S, = cc w, exp {-\|2{^i'^l:^\i, - 2ii'^T.^^U'^x^i)} J] ' 

£=1 

for j = 1, . . . ,oo. Once again, one may introduce slice sampling latent variables and implement the 
exact block Gibbs sampler or use the block Gibbs sampler directly to make the total number of states 
finite. 

Step 9. Update the Vj's by sampling from ^ Dir{a\, . . . , a*), where = X]"=i ^iVi = i,Si = j] + ai for 

i=l,...,c. 

7.2 Simulation Study 

To demonstrate the performance of the classifier we conduct a small simulation study. Synthetic data sets 
are generated using two methods. The first method treats the PSC as a data generating mechanism, the 
second is similar to the data generating scheme found on page 16 of Hastie, Tibshirani and Freedman |16] 
(here after referred to as HTF). We briefly describe both. 



When the PSC is being used as a data generating mechanism, the X matrix is generated using (6.11. 
We set TO ~ 100, = 0.1, and fc = 2. As this produces a feature space with three clusters, Y takes on 
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values in {1,2,3} with probabilities [wi{UqX),'W2{UqX),W3{UqX)] where Uq is found in (6.2). The second 



data generating scenario consists of two classes with 100 observations each. The observations are drawn 



from the Gaussian mixture l/10^ioo('Tij, 1/5/). The 10 means, rrij, for the two classes are generated 



independently from -/Vioo('7i; I) and A^ioo(??2, 1) respectively (ryi and 772 are defined in (6.1 )). For each scenario 
100 data sets are generated. For the first, 100 training and 100 testing observations were generated and for 
the second 200 test and 200 training observations were used. The PSC, k nearest neighbor (KNN), and 
mixture discriminant analysis (MDA) were employed to classify the response from the testing data sets. 
KNN and MDA procedures were selected as competitors because KNN is an algorithmic based procedure 
that is known to perform well in a variety of settings (see HTF) and MDA is a flexible model based Gaussian 
mixture classifier (see Hastie and Tibshirani [14 ). We employ the knn 30 and mda [15 functions both of 
which are available freely from the R software [22] to implement the KNN and MDA methods. For the KNN 
we set fc = 6 for data generated from the PSC and A: = 25 for HTF data. These values were deemed to produce 
the smallest misclassification rate for a few synthetic data sets from both data generating scenarios. For the 
same reason, with regards to the MDA, the number of components for each classes Gaussian mixture was set 
at 5. Choosing k in this manner provides an advantage to KNN and MDA when comparing misclassification 
rates to the PSC. 

For the PSC, 1000 MCMC iterates were collected after a burn-in of 10,000 and thinning of 100. Con- 
vergence was assessed using history plots of the MCMC draws for a few data sets. The out of sample 
misclassification rates averaged over the 100 data sets can be found under each procedures respective head- 
ing in Table [2] 

Table 2: Misclassification rates from the simulation study. Data were generated using the PSC and the 
method detailed on page 16 of Hastie, Tibhshirani and Feedman (HTF) J£j 

Data Generating 

Mechanism PSC KNN MDA 

PSC 0.060 0.158 0.639 

HTF 0.047 0.269 0.369 



It appears as if the PSC is able to more accurately classify the categorical response from the testing data 
compared to KNN and MDA. This appears to be true regardless of what k is fixed to be. Preliminary studies 
indicated that the PSC classifier still out preformed KNN and MDA (though not as drastically) even with 
correlated and non-Gaussian non-signal predictors. 
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7.3 Illustration on Real Datasets 

We now apply the PSC to two real data sets both of which are readily available in R. The first consists of two 
classes and 7 quantitative predictors. The predictors are physiological measurements taken on Pima Indian 
women with the goal of predicting the presence or absence of diabetes. To these 7 predictors we add another 
93 which are comprised of random standard Gaussian draws. The dataset is split randomly into training 
and testing sections. The training section consists of 200 women, 68 of which are diagnosed with diabetes, 
while the testing section consists of 332 women, 109 of which are diagnosed with diabetes. 

The second data set we consider is the so called iris data set. Here the response consists of three classes 
each one representing a specific flower species. The four predictors are length and width measurements 
corresponding to the sepal and petal of a flower. The goal is to use these four measurements to predict the 
flower species. To the four predictors we add 96 that are comprised of random standard Gaussian draws. The 
data set consists of 150 observations with each flower species having 50. Fifty observations were randomly 
selected to comprise the testing data while the remaining 100 were used for the training data set. 

To both data sets we applied the PSC in addition to KNN classifier and a MDA classifier. For the KNN 
classifier, we chose the value of k that minimized the misclassification rate which turned out to be fc = 5 
for the iris data and fc = 24 for the diabetes data. Similarly, the number of components comprising the 
Gaussian mixtures of the MDA classifier was selected on the basis of minimizing the misclassification rate. 
The number of components turned out be 5 for the iris data and 7 for the diabetes data. Note that choosing 
k in this manner gives an unfair advantage to KNN and MDA relative to PSC, which does not use the test 
data at all in training. We fit the PSC to both data sets by collecting 1000 MCMC iterates after a burn-in of 
10,000 and thinning of 100. Convergence was monitored using trace plots from two chains that were started 
at different values. Prior to analysis variables were standardized. The misclassification rates can be found 
in Table m 

Table 3: Misclassification rates for the iris and diabetes data sets. 
Dataset PSC KNN MDA 

Iris 0.22 0.55 0.51 
Diabetes 0.26 0.29 0.37 



It appears that the PSC was able to classify the testing data response in the presence of a high dimensional 
feature space much more accurately than cither KNN or MDA. 
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8 Conclusions 



This article has proposed a novel methodology for nonparametric Bayesian learning of an afhne subspace 
underlying high-dimensional data. Clearly, massive-dimensional data are now commonplace and there is a 
need for flexible methods for dimensionality reduction that avoid parametric assumptions. In this context, 
the Bayesian paradigm has substantial advantages over commonly used machine learning, computer science 
and frequentist statistical methods that obtain a point estimate of the subspace or manifold which the data 
are concentrated near. As there is unavoidably substantial uncertainty in subspace or manifold learning, 
it is important to fully account for this uncertainty to avoid misleading inferences and obtain appropriate 
measures of uncertainty in estimating densities, performing predictions and identifying important predictors. 
We accomplish this in a Bayesian manner by placing a probability model over the space of affine subspaces, 
while developing a simple and efficient computational algorithm relying on Gibbs sampling to estimate the 
subspace and its dimension or model-average over subspaces of different dimension. The model is theoretically 
proved to be highly flexible and posterior consistency is achieved under appropriate prior choices. The 
proposed model and computational algorithm should be broadly useful beyond the density estimation and 
classification settings we have considered. 

A potential alternative to our approach mentioned in Section 1 is to use a mixture of sparse factor 
models to build a tangent space approximation to the manifold the data are concentrated near. Sparse 
Bayesian normal linear factor models are a successful approach for dimensionality reduction (Carvalho et 
al, [5; Bhattacharya and Dunson [3]), but make restrictive normality assumptions and are limited in their 
ability to reduce dimensionality by linearity assumptions. By mixing factor models, one can certainly obtain 
a more flexible characterization, but challenging computational issues arise in accommodating uncertainty 
in the number of factors and locations of zeros in the factor loadings matrix for each of the multivariate 
Gaussian components in the mixtures. Indeed, even in modest dimensions for a normal linear factor models. 
Lopes and West [21] encountered difficulties in efficiently inferring the number of factors, and recommending 
using a reversible jump MCMC algorithm that required a preliminary MCMC run for each choice of the 
number of factors. For mixture of factor models, one obtains a extremely rich over-parametrized black box. 
We propose a fundamentally new alternative that directly specifies an identifiable model based on geometry, 
while also developing an efficient Gibbs sampler that can infer the dimension of the subspace automatically 
without RJMCMC. Although our initial focus was on data in a Euclidean space, related models can be 
developed for non-Euclidean manifold data, as we will explore in ongoing work. 
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Appendices 

A Proofs 

As a reminder in what follows Br.m refers to the set {x G : < r}. For a subset V of densities and 
e > 0, the Li-metric entropy -/V(e,I?) is defined as the logarithm of the minimum number of e-sized (or 
smaller) Li subsets needed to cover V. 



A.l Proof of Lemma (3.3) 



Proof. Any density / in 2?^ can be expressed as J^^ Nm{i^, Yj)Q[dv) with E = UoTiqUq + (Tq(/„i — UqUq), 
Q — P o (f)^^, 0(a;) — Uqx, and {k,Uo,9,Tio,a, P) G H^. The assumption on 7r2 and will imply that E 
has all its eigen- values in [/i^j, A^]. 

We also claim that QiB"^^ ^) < e. To see that, note that \\(l)ifi)\\^ = + \\0\\^ < 2rl when- 

ever 11/^11 < r„ and ||6'|| < r„. Hence B^„,fe C <j)-\B^^^ J if ||6l|| < r„. Therefore e > P{B^r„,k) > 
P((0-i(Sy2.„.r„))1 ^P°^'^iB-^2r„ J for aU (P,0) e Ht,. Hence the claim follows. 

Therefore 

Vl.CVl.^if = j N^{,y,E)Q{d,y) : Q(i?^,,^,„) < e, A(S) € [hlA^]}, 

A(E) denoting the eigen-values of E. From Lemma 1 of Wu and Ghosal fS^*, it follows that N{e,'D'^) < 
C(rn/hn)™ and this completes the proof. □ 



A. 2 Proof of Lemma (3.4) 



The proof is similar in scope to the proof of Lemma 2 in Wu and Ghosal [35] ■ Throughout the proof, C will 
denote constant independent of n. 

Proof. Given k, U, 9, <j and /i = /ii, . . . , /i„ iid P, ^ (0(Mi)i ^) i * — 1, . . . , n, independently and are 
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independent of P. Hence 



Pr{PiB^r,,,k) > ^\k,X„) = E{Pr{PiB^,^^^,) > e\k, ^iJ\k,X„) . 

From 'JTl, given /i^ and k, for A C 3?'=^, P{A) - Beta{wkPk{A) + N{A),Wk{l - Pk) + n- N{A)) where 
-^(^) — -^{/JiSA}- Hence using the Markov inequahty, 

PriPiB'i f.) > e\k,fi < p ^ 

V V r„,k.j - I - ^(^n + Wk) 

Therefore 

P ( B^ ) 1 ^ 

^ ^ ^ ' 'i— 1 

Denote the above two terms as Ti and Then Ef^Ti = Ti — > as r„ — > oo. Under the marginal prior 
given fc, has an exchangable distribution 7r„(//^|fc) on (3?''')" (see fl?). Also since X„ are iid given ft, it 
follows that 

Now 

Pr(/xi e fc|A:,X„) < Pr(^i e fe,min(cr) > /i„|fc,X„) + 

Pr(min(cr) < /i„|/c,X„). 

The last term above converges to a.s. by the assumption on 7r2. Hence to complete the proof, it remains 
to show that 

Ef^ |Fr(^i e J,, min(cr) > hn\k, X„) } — > as n — oo. 

To compute the probability in above, we denote by 7ri„(/ii|/i_i, fc) the conditional distribution of /ii given 
= (/i2, . . . ,/in) J and by 7r_i„(/i_i|fc) the marginal distribution of fi^i under the joint 7r„. Then 

Pr(^i e B,=,^_fc,min(cr) > /i„|fc,X„) - A(X„)/P(X„) 
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where A^Kn) = 

Y\_Nm{Xi;(j){p), i;)d7ri„(/ii|/x_i, k)dTT^in{fi-i\k)dTri{UQ,9\k)dTr2{(T\k) 



i—1 

m.in(<T)>/iTi Ml II >''"ti 



and B(X„) 

A^„(Xi; (f>{fi), E)(i7ri„(/ii|/i_i, k)dTT-in{fi-i\k)dni{Uo,0\k)dTr2{cr\k). 

i=l 

We use Ef^{A{X^)/B{X.^)} < 

™P 4?$^ / .ftix)dx+ I ft{x)dx. (A.l) 



and upper bound the terms in above. 

First we upper bound A(X„) when \\Xi\\ < r„/2. We express iV„i(Xi; E) as 

Nk{U^X,;fi,,^o) 

and note that < r„/2, > r„ and /i„ < cTj < A Vj < fc imphes 



iVfe([/^Xi;Mi,Eo) < C/i-'=exp ""^ 



2 
n 

8A^' 



Therefore A(X„) < 



Ch-'^ exp ^ I (r7-2)^ exp ^(Xi - - UoU^){X, - 9) 

n 

Y[ Nm{Xf,(j,{fi,), j:)dTT_in{f^_i\k)dni{Uo,e\k)dTr2{(T\k). 



(A.2) 



Next we lower bound _B(X„) when Xi G B,^^/2,m- The conditional distribution 7ri„ can be expressed as 
Et2 + ^^,Pk (see M)- Hence i3(X„) > 



Wk + n 



I / n^m(-'^i;'/'(Aii),S)pfc(^i)d/^id7r_i„(/i_i|A:)d7ri(C/,6'|/c)(i7r2((T|fc). 
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Now 



where 



Nk(UQXi;fii,^o)pk{fii)dfii> I Nk{UgXi;fii,j:o)pk{fii)dfii 

S 



s = {^i^■.Y,af{u'kX,~-^,,)f<l}. 

1=1 

For /ii € 5, Nkip'^Xi, fii, Eo) > Hi "^7^^"^^^ ^^'^ Pfe(Mi) ^ ^kn with (5fe„ defined in the Lemma. Therefore 

k 

J A'^fc(C/oXi;^i,So)pfc(/ii)d^i > CSknY]_'^J^ J d-f^i = ^^kn 
and hence when < r„/2, i?(X„) > 



/_ — 1 " 

cxp ^(Xi - 0)'(/„, - C/ot/o)(^i - d)Y[N„,{X,; S) 



Combining (A.2) and (A.3|, we get 



A(X„) 
sup -HT^ 



<CnS-^K'^eM-rl/SA^) 



Plug this in ( |Al] ) to conclude £;/,{yl(X„)/-B(X„)} < 



(A.3) 



Cn5,->;'=exp(-r2/8A2) +Pry,(||X|| > r„/2) 



(A.4) 



which converges to zero by assumption. 

Under assumption Bl' and ^''^ < the sequence in (A.4) has a finite sum which results in 
the stronger conclusion. This completes the proof. □ 
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A.3 Proof of Corollary 



Proof. By Theorem 3.5 to show a.s. strong posterior consistency, we need to get positive sequences r„ and 
hn which satisfy 

ri"^(r„//i„)'" 0, ^r-2(i+")" < oo, and (A.S) 

OO 

nS^^h-'^ exp{-rl/8A') < oo, (A.6) 

ri=l 

and the prior probabihties Pr(||^?|| > r„|A:) and Pr(min(cr) < hn\k) decay exponentiaUy. Set r„ = n^/" and 
hn = rT^I^ . Then (A.5) is clearly satisfied. 



By the choice of p^, fc > 1, it is easy to check that (5a;„ > Cexp with C denoting positive constants 
independent of n all throughout. Then (A.6) is clearly satisfied because of the assumption t| > 4A^. 



Because 116*11° follows a Gamma distribution given k, k < m — 1, the probability Pr(||(?|| > r„|fc) can be 
upper bounded by C exp(— ArJJ) for some A > 0. This decays exponentially with r„ — n^/". 

Lastly it remains to check that Pr{mm{(T) < hn\k), decays exponentially. When the coordinates of cr are 
all equal, the probability can be upper bounded by C exp(— A/i^'') for some A > 0. This decays exponentially 
with hn = nT^/'^. In case the coordinates are iid, the probability can be upper bounded by Cnexp(— A/i~^) 
which also decays exponentially by the choice oi hn- □ 



A. 4 Proof of Theorem (4.1) 
Proof. Simplify /i as 



/i(i?, 6) = /i(P, B) + \\R -Rf + \\e- Of 

= f,iR, 0) + \\R-Rf + \m\' + R)i0 - 

>f,{R,e) + \\R-Rr + \\Re\\^. (A.7) 



Equality holds in (|A7| iff 6* = (/ - R)e. Then 

fi{R,e) = fc - Tr{(2P - e0')R} + C 

where k =Rank(i?) and C denotes something not depending on R, 9. From the proof of Proposition , 
given k one can show that the value of R minimizing /i above is ^j^'j ^^"^ ^^"^ minimizer is unique iff 
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Xk > \k+i- Then 

k 

Now one needs to find the k minimizing the above risk which is as mentioned. This completes the proof. □ 



A. 5 Proof of Theorem (4.2) 



Proof. The minimizer w ~ w is obvious. Then 

/2(C/,u)) ^\\U-Uf + C = ki- 2Tr{7(',^)C/(fc^) + C, 

ki being the rank of U and C symbohzing any constant not depending on U. For ki fixed, it is proved in 
Theorem 10.2|2] that the minimizer U is as in the theorem. It is unique iff is invertible. Plug that 

U and the risk function becomes, as a function of fci, 

/3(/ci) = fci-2Tr([7(',^ 1/2. 
We find the value of fci between 1 and m minimizing /a and set k = ki — 1. □ 
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